Diffusion Model

Diffusion Models are a class of generative models that simulate the process of data being gradually corrupted by noise (forward diffusion), and then learn the reverse process to recover data from noise.


1. Core Idea

Diffusion models consist of two core processes:

Process Direction Description
Forward Diffusion Data Noise Gradually add Gaussian noise to data until it becomes pure noise
Reverse Denoising Noise Data Learn to gradually recover original data from noise

[!NOTE] Physical Analogy
Similar to diffusion phenomenon in thermodynamics: a drop of ink in water gradually diffuses until uniformly distributed. The reverse process is “condensing” back to the initial state from uniform distribution.


2. Forward Diffusion Process

2.1 Discrete-Time Formulation (DDPM)

Given data point x0q(x) , gradually add Gaussian noise:

q(xtxt1)=N(xt;1βtxt1,βtI)

where βt(0,1) is the noise schedule, typically satisfying β1<β2<<βT .

After T steps, xT approximates standard Gaussian distribution N(0,I) .

Common Noise Schedules:

Schedule Formula Characteristics
Linear βt=β1+(t1)βTβ1T1 Simple, widely used
Cosine α¯t=f(t)f(0) , f(t)=cos(t/T+s1+sπ2)2 Better for small t
Quadratic βt=(β1+(t1)βTβ1T1)2 Slower initial noise

2.2 Reparameterization Trick

Define αt=1βt , α¯t=s=1tαs , then:

xt=α¯tx0+1α¯tϵ,ϵN(0,I)

Key Property: We can sample xt at any timestep directly without iterating:

q(xtx0)=N(xt;α¯tx0,(1α¯t)I)

2.3 Continuous-Time Formulation ([[Stochastic Differential Equation (SDE)|SDE]])

The forward process can be written as [[Stochastic Differential Equation (SDE)|Stochastic Differential Equation]]:

dx=f(t)xdt+g(t)dWt

where Wt is [[Wiener Process|Wiener Process]], f(t) is the drift coefficient, and g(t) is the diffusion coefficient.

Discrete-Continuous Correspondence:

α¯t=exp(0tβ(s)ds)

3. Reverse Denoising Process

3.1 Discrete-Time Formulation

The reverse process is also modeled as Gaussian distribution:

pθ(xt1xt)=N(xt1;μθ(xt,t),Σθ(xt,t))

Learn mean μθ and variance Σθ through neural network.

Optimal Reverse Distribution (when βt0 ):

q(xt1xt,x0)=N(xt1;α¯t1βt1α¯tx0+αt(1α¯t1)1α¯txt,1α¯t11α¯tβtI)

3.2 Simplified Training Objective (DDPM)

Ho et al. (2020) proposed simplified loss function:

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]
  • ϵN(0,I) : true noise
  • ϵθ(xt,t) : noise predicted by neural network

Full Variational Lower Bound:

LVLB=Eq[DKL(q(xT|x0)p(xT))+t=2TDKL(q(xt1|xt,x0)pθ(xt1|xt))logpθ(x0|x1)]

3.3 Continuous-Time Formulation (Score-based)

Reverse [[Stochastic Differential Equation (SDE)|SDE]]:

dx=[f(t)xg(t)2xlogpt(x)]dt+g(t)dW¯t

where W¯t is [[Wiener Process|Wiener Process]] in reverse time, and xlogpt(x) is [[Score Function|Score Function]].

[[Score Function]] Estimation:

The [[Score Function]] is learned via score matching:

L(θ)=12EtEx0Ext|x0[sθ(xt,t)xtlogq(xt|x0)2]

where xtlogq(xt|x0)=xtα¯tx01α¯t .


4. Core Formula Summary

[!QUOTE] DDPM Forward Noising

xt=α¯tx0+1α¯tϵ,ϵN(0,I)

[!QUOTE] DDPM Simplified Loss

Lsimple=Et,x0,ϵ[ϵϵθ(xt,t)2]

[!QUOTE] Forward [[Stochastic Differential Equation (SDE)|SDE]]

dx=f(t)xdt+g(t)dWt

[!QUOTE] Reverse [[Stochastic Differential Equation (SDE)|SDE]]

dx=[f(t)xg(t)2xlogpt(x)]dt+g(t)dW¯t

[!QUOTE] [[Probability Flow ODE]]

dx=[f(t)x12g(t)2xlogpt(x)]dt

5. Main Variants

Model Features Key Contributions
DDPM Discrete-time, pixel space Established the basic framework of diffusion models
DDIM Deterministic sampling, accelerated generation Non-Markovian forward process, supports skip-step sampling
Score [[Stochastic Differential Equation (SDE)|SDE]] Continuous-time [[Stochastic Differential Equation (SDE)|SDE]] framework Unified DDPM and Score Matching
LDM Latent space diffusion Perform diffusion in VAE latent space, reducing computation
DiT Transformer architecture Use Transformer instead of U-Net
EDM Improved design choices Better architecture, sampling, and training
Stable Diffusion Text-conditional LDM Cross-attention for text guidance, widely adopted

5.1 DDIM (Denoising Diffusion Implicit Models)

DDIM generalizes DDPM to non-Markovian processes:

xt1=α¯t1(xt1α¯tϵθ(xt,t)α¯t)+1α¯t1σt2ϵθ(xt,t)+σtz

where σt=η1α¯t11α¯t1α¯tα¯t1 , and η[0,1] :

  • η=1 : DDPM (stochastic)
  • η=0 : DDIM (deterministic)

Key Advantage: Can use fewer timesteps (e.g., 50 instead of 1000) for faster sampling.

5.2 LDM (Latent Diffusion Models)

Instead of diffusing in pixel space, LDM operates in latent space:

  1. Compress: z=E(x) using VAE encoder
  2. Diffuse: Apply diffusion process to z
  3. Decode: x^=D(z0) using VAE decoder

Benefits:

  • Lower dimensionality (e.g., 64×64×4 vs 512×512×3 )
  • Faster training and inference
  • Perceptual compression preserves semantic information

5.3 DiT (Diffusion Transformers)

Replace U-Net with Transformer architecture:

  • Patching: Split image into patches (like ViT)
  • Self-attention: Capture long-range dependencies
  • Scaling: Better performance with larger models
  • Flexibility: Easy to incorporate conditioning

Result: DiT-XL/2 outperforms U-Net on ImageNet generation.


6. Training and Sampling Algorithms

6.1 Training Loop

1
2
3
4
5
6
7
8
9
10
11
# Pseudocode
while not converged:
x_0 = sample_from_dataset()
t = sample_uniform(1, T)
epsilon = sample_normal(0, I)

x_t = sqrt(alpha_bar[t]) * x_0 + sqrt(1 - alpha_bar[t]) * epsilon
loss = MSE(epsilon, epsilon_theta(x_t, t))

loss.backward()
optimizer.step()

Training Tricks:

  • t weighting: Weight loss by 1/E[ϵ2] or use uniform weighting
  • Architecture: U-Net with attention, group normalization, SiLU activation
  • EMA: Exponential moving average of model weights for better sampling
  • Dropout: Apply to attention layers for regularization

6.2 Sampling Loop (DDPM)

1
2
3
4
5
x_T = sample_normal(0, I)
for t in reversed(range(1, T+1)):
z = sample_normal(0, I) if t > 1 else 0
epsilon = epsilon_theta(x_t, t)
x_{t-1} = 1/sqrt(alpha_t) * (x_t - (1-alpha_t)/sqrt(1-alpha_bar[t]) * epsilon) + sqrt(beta_t) * z

6.3 Advanced Sampling Methods

Method Steps Approach
DDPM 1000 Original stochastic sampling
DDIM 50-100 Deterministic, skip steps
[[DPM-Solver]] 10-20 ODE solver with adaptive steps
[[DPM-Solver]]++ 10-15 Improved stability
UniPC 5-10 Unified predictor-corrector
Consistency Models 1-5 Direct mapping, distillation

Predictor-Corrector Framework (for [[Stochastic Differential Equation (SDE)|SDE]]-based models):

  1. Predictor: Take one step using reverse [[Stochastic Differential Equation (SDE)|SDE]]/ODE
  2. Corrector: Apply Langevin dynamics to refine sample
  3. Repeat: Alternate for better quality
1
2
3
4
5
6
7
8
9
10
# Predictor-Corrector Sampling
for t in reversed(timesteps):
# Predictor step (Euler-Maruyama)
score = score_model(x_t, t)
x_t = x_t + drift(x_t, t) * dt + diffusion(t) * score * dt + noise

# Corrector step (Langevin)
for _ in range(corrector_steps):
score = score_model(x_t, t)
x_t = x_t + step_size * score + sqrt(2 * step_size) * noise

7. Advantages and Disadvantages

Advantages

  • High generation quality: Reaches or exceeds GAN level
  • Stable training: No mode collapse problem like GAN
  • Elegant theory: Based on thermodynamics and [[Stochastic Differential Equation (SDE)|SDE]] mathematical foundation
  • Flexible conditioning: Easy to incorporate text, image, or other conditions
  • Coverage: Better mode coverage than GANs (less mode collapse)
  • Likelihood estimation: Can compute exact likelihoods (via ODE)

Disadvantages

  • Slow sampling speed: Requires tens to hundreds of iterative steps
  • Sensitive to hyperparameters: Noise schedule affects generation quality
  • High computational cost: Training requires significant resources
  • Blurriness: May produce blurry samples compared to GANs (in pixel space)

Acceleration Methods

Algorithm-Level:

  • DDIM (Deterministic sampling)
  • [[DPM-Solver]] (Ordinary differential equation solver)
  • Progressive Distillation
  • Consistency Models (One-step generation)

Architecture-Level:

  • Latent space diffusion (LDM)
  • Distilled models (smaller, faster)
  • Quantization and pruning

Hardware-Level:

  • GPU optimization
  • Parallel sampling
  • Mixed precision training

8. Applications in AI Image Generation

Application Representative Models Features
Text-to-Image DALL-E 2, Stable Diffusion, Imagen Diffusion model + CLIP + Latent space
Image Editing InstructPix2Pix, Prompt-to-Prompt Conditional guided editing
Video Generation Stable Video Diffusion Introduce temporal dimension
3D Generation DreamFusion, Magic3D Score Distillation Sampling (SDS)
Image Super-Resolution SR3, RePaint Diffusion + Denoising
Inpainting Stable Diffusion, GLIDE Mask-guided generation
Style Transfer StyleDrop, Custom Diffusion Style adaptation
Controlled Generation ControlNet, T2I-Adapter Spatial control signals

8.1 Text-to-Image Generation

Architecture:

  1. Text Encoder: CLIP, T5, or custom transformer
  2. Conditioning: Cross-attention in U-Net/DiT
  3. Diffusion: Latent space denoising
  4. Decoder: VAE decoder to pixel space

Training Data: LAION-5B, COCO, internal datasets

8.2 Image-to-Image Translation

Given source image xsrc and target description:

xresult=Sample(xTx0xsrc,text)

Methods:

  • Img2Img: Add noise to source, then denoise with conditioning
  • ControlNet: Copy and adapt U-Net weights for control
  • IP-Adapter: Image prompt adapter for visual conditioning

9. Conditional Diffusion Models

9.1 Classifier Guidance

ϵ~θ(xt,t,c)=ϵθ(xt,t)1α¯txtlogpϕ(cxt)

Pros:

  • Works with pre-trained classifiers
  • Flexible guidance strength

Cons:

  • Requires training separate classifier
  • Limited to classification conditions

9.2 Classifier-Free Guidance

ϵ~θ(xt,t,c)=(1+w)ϵθ(xt,t,c)wϵθ(xt,t)

where w>0 is the guidance strength.

Training: Randomly drop condition (e.g., 10% probability) during training to learn unconditional model.

Pros:

  • No separate classifier needed
  • Works with any condition type (text, image, etc.)
  • Better quality than classifier guidance

Cons:

  • Requires larger model (learns conditional + unconditional)
  • Guidance strength w needs tuning

9.3 Multi-Modal Conditioning

Modern diffusion models support multiple conditions:

Condition Type Encoding Method Integration
Text CLIP, T5 transformer Cross-attention
Image CLIP vision, VAE encoder Concatenation, attention
Depth/Edges CNN encoder ControlNet, adapter
Pose/Skeleton Graph neural network Spatial injection
Audio VGGish, CLAP Cross-attention

9.4 Controllability Methods

ControlNet:

  • Clone U-Net encoder layers
  • Train with zero convolution initialization
  • Lock original model, train control branches

IP-Adapter:

  • Add image encoder parallel to text encoder
  • Use decoupled cross-attention
  • Enables image prompt guidance

10. Theoretical Analysis

10.1 Connection to Variational Inference

Diffusion models optimize the variational lower bound (ELBO):

logpθ(x0)Eq[logpθ(x0|x1)]t=2TDKL(q(xt1|xt,x0)pθ(xt1|xt))DKL(q(xT|x0)p(xT))

Interpretation:

  • Term 1: Reconstruction loss
  • Terms 2: Consistency between forward and reverse processes
  • Term 3: Prior matching (ensure xT is close to Gaussian)

10.2 Connection to Score Matching

Score matching objective:

J(θ)=12Ep(x)[sθ(x)xlogp(x)2]

For diffusion models, this becomes denoising score matching:

J(θ)=12t=1TEx0,xt[sθ(xt,t)xtlogq(xt|x0)2]

10.3 Neural Tangent Kernel (NTK) Analysis

In the infinite-width limit, diffusion model training can be analyzed via NTK:

  • Training dynamics: Governed by kernel regression
  • Generalization: Related to kernel eigenvalues
  • Mode coverage: Depends on data spectrum

10.4 Information Bottleneck Perspective

Forward diffusion as information bottleneck:

I(x0;xt)=Information preserved at time t
  • Early timesteps: High mutual information (preserve details)
  • Late timesteps: Low mutual information (only semantic info)
  • Optimal schedule balances compression and preservation

11. Core Formula Cards

[!QUOTE] Reparameterization Noising

xt=α¯tx0+1α¯tϵ

[!QUOTE] DDPM Loss

Lsimple=ϵϵθ(xt,t)2

[!QUOTE] DDIM Sampling

xt1=α¯t1(xt1α¯tϵθ(xt,t)α¯t)+1α¯t1σt2ϵθ(xt,t)+σtz

[!QUOTE] Classifier-Free Guidance

ϵ~=ϵθ(xt,t,)+w(ϵθ(xt,t,c)ϵθ(xt,t,))

12. Evaluation Metrics

12.1 Sample Quality

Metric Description Range
FID Fréchet Inception Distance Lower is better (0 is perfect)
IS Inception Score Higher is better
Precision/Recall Quality vs. diversity trade-off [0, 1]
KID Kernel Inception Distance Lower is better

FID Formula:

FID=μrμg2+Tr(Σr+Σg2(ΣrΣg)1/2)

where μr,Σr are real data statistics and μg,Σg are generated statistics.

12.2 Likelihood Evaluation

Bits per dimension (bpd):

bpd=log2pθ(x)dim(x)

Lower bpd indicates better likelihood.

12.3 Diversity Metrics

  • Mode coverage: Percentage of data modes captured
  • LPIPS: Learned Perceptual Image Patch Similarity (diversity)
  • Unique samples: Ratio of unique generated samples

12.4 Human Evaluation

  • User studies: Preference ratings
  • Text-image alignment: CLIP score for text-to-image
  • Aesthetic quality: Aesthetic score predictors

13. Practical Implementation Tips

13.1 Network Architecture

U-Net Design:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Input

Downsample Block 1 (128 channels)

Downsample Block 2 (256 channels)

Downsample Block 3 (512 channels)

Middle Block with Attention (1024 channels)

Upsample Block 3 (512 channels) + Skip Connection

Upsample Block 2 (256 channels) + Skip Connection

Upsample Block 1 (128 channels) + Skip Connection

Output (3 channels)

Key Components:

  • ResNet blocks: Groups of 2-3 conv layers with skip connections
  • Attention: Self-attention at lowest resolution (e.g., 32x32)
  • Time embedding: Sinusoidal position encoding → MLP
  • Conditioning: Cross-attention for text, AdaGN for class labels

13.2 Training Best Practices

Hyperparameters:

Parameter Recommended Value Notes
Timesteps 1000 Standard, can use fewer for fast sampling
Batch size 256-512 Larger is better if memory allows
Learning rate 1e-4 Use cosine decay schedule
Optimizer AdamW β₁=0.9, β₂=0.999
EMA rate 0.9999 Exponential moving average
Gradient clipping 1.0 Prevents explosion

Data Augmentation:

  • Random horizontal flip
  • Random crop and resize
  • No color jitter (changes data distribution)

13.3 Debugging Checklist

Check noise schedule: Plot α¯t vs t , ensure smooth decay
Monitor loss curves: Should decrease smoothly, no spikes
Validate reparameterization: xt should match theoretical distribution
Test sampling: Start with small model, verify basic functionality
Check gradients: Norm should be reasonable (< 10)
Visualize intermediates: Sample at different timesteps during training

13.4 Common Issues and Solutions

Problem Cause Solution
Blurry samples Undertraining, high noise Train longer, check schedule
Mode collapse Low capacity, overfitting Increase model size, add dropout
Training instability High learning rate Reduce LR, add gradient clipping
Slow sampling Too many timesteps Use DDIM, [[DPM-Solver]]
Poor conditioning Weak guidance Increase guidance strength w

14. Recent Advances (2023-2024)

14.1 Consistency Models

Key Idea: Learn direct mapping from noise to data in one step.

fθ(xt,t)x0t

Benefits:

  • 1-step generation (vs. 1000 steps)
  • Distillation from pre-trained diffusion models
  • Competitive quality with fewer steps

14.2 Rectified Flows

Concept: Learn straight trajectories between noise and data.

dxdt=vθ(x,t)

where vθ is velocity field pointing from x1 (noise) to x0 (data).

Advantage: Fewer integration steps needed.

14.3 Diffusion Transformers (DiT)

  • Replace U-Net with Transformer
  • Scale to billions of parameters
  • Better performance with larger models
  • Used in SORA, Stable Diffusion 3

14.4 [[Flow Matching]]

General framework encompassing diffusion models:

dxdt=ut(x)

where ut(x) is learned vector field.

Unifies:

  • Diffusion models
  • Normalizing flows
  • Continuous normalizing flows

14.5 Video Diffusion Models

Challenges:

  • Temporal consistency
  • Computational cost (3D + time)
  • Long sequence generation

Solutions:

  • Spatiotemporal attention
  • Cascaded generation
  • Latent video diffusion

15. Comparison with Other Generative Models

Model Quality Diversity Training Stability Sampling Speed Likelihood
GAN ★★★★☆ ★★☆☆☆ ★☆☆☆☆ ★★★★★
VAE ★★☆☆☆ ★★★★☆ ★★★★★ ★★★★★
Diffusion ★★★★★ ★★★★★ ★★★★★ ★★☆☆☆
Flow ★★★☆☆ ★★★★☆ ★★★★☆ ★★★☆☆
EBM ★★★★☆ ★★★☆☆ ★★☆☆☆ ★☆☆☆☆

When to use Diffusion Models:

  • ✓ Need high-quality samples
  • ✓ Mode coverage is important
  • ✓ Training stability is critical
  • ✗ Real-time generation required
  • ✗ Limited computational resources

  • [[Wiener Process|Wiener Process]]
  • [[Stochastic Differential Equation (SDE)|SDE]]
  • [[Score Function]]
  • [[Probability Flow ODE]]
  • [[Fokker-Planck Equation]]
  • [[Kolmogorov Equations]]
  • [[DDIM]]
  • [[DPM-Solver]]
  • [[Flow Matching]]
  • [[Markov Process]]
  • [[Neural ODE]]
  • [[ResNet]]
  • [[U-Net]]
  • [[DiT]]
  • [[Vision Transformer (ViT)]]
  • [[Variational Autoencoder (VAE)]]
  • [[Generative Adversarial Network (GAN)]]
  • [[Langevin Dynamics]]
  • [[Denoising Score Matching]]
  • [[Itô Integral]]
  • [[Martingale]]

Dataview Query

1
2
3
LIST
FROM #diffusion_model
SORT file.ctime DESC

References

  • Paper:Denoising Diffusion Probabilistic Models (Ho et al., 2020)
  • Paper:Score-Based Generative Modeling through SDEs (Song et al., 2021)
  • Paper:High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022)
  • Blog:What are Diffusion Models? - Lilian Weng
  • Course:CS236 Deep Generative Models (Stanford)